SgmlV1, Main, Exploration, bibRecord, 000998

Finding Frequent Structural Features among Words in Tree-Structured Documents

Identifieur interne : 000998 ( Main/Exploration ); précédent : 000997; suivant : 000999

Finding Frequent Structural Features among Words in Tree-Structured Documents

Auteurs : Tomoyuki Uchida [Japon] ; Tomonori Mogawa [Japon] ; Yasuaki Nakamura [Japon]

Source :

Lecture Notes in Computer Science [ 0302-9743 ]

RBID : ISTEX:40B3A82782E1E9F30CED10BDF201F393113E4EB9

Descripteurs français

Pascal (Inist)
- Analyse donnée, Document électronique, Extraction information, Fouille donnée, Langage HTML, Langage SGML, Langage XML, Latex, Mot, Méthode arborescente, Structure fichier, Texte.
Wicri :
- topic : Document électronique.

English descriptors

KwdEn :
- Data analysis, Data mining, Electronic document, File structure, HTML language, Information extraction, Latex, SGML language, Text, Tree structured method, Word, XML language.

Abstract

Abstract: Many electronic documents such as SGML/HTML/XML files and LaTeX files have tree structures. Such documents are called tree-structured documents. Many tree-structured documents contain large plain texts. In order to extract structural features among words from tree-structured documents, we consider the problem of finding frequent structured patterns among words in tree-structured documents. Let k≥ 2 be an integer and (W 1,W 2,...,W k ) a list of words which are sorted in lexicographical order. A consecutive path pattern on (W 1 , W 2 ,..., W k ) is a sequence 〈t 1;t 2;...,t k − 1〉 of labeled rooted ordered trees such that, for i=1,2,...,k-1, (1) t i consists of only one node having the pair (W i ,W i + 1) as its label, or (2) t i has just two nodes whose degrees are one and which are labeled with W i and W i + 1, respectively. We present a data mining algorithm for finding all frequent consecutive path patterns in tree-structured documents. Then, by reporting experimental results on our algorithm, we show that our algorithm is efficient for extracting structural features from tree-structured documents.

Url:

https://api.istex.fr/ark:/67375/HCB-3GZ3WVK7-F/fulltext.pdf

DOI: 10.1007/978-3-540-24775-3_43

Affiliations:

Japon

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Finding Frequent Structural Features among Words in Tree-Structured Documents</title>
<author><name sortKey="Uchida, Tomoyuki" sort="Uchida, Tomoyuki" uniqKey="Uchida T" first="Tomoyuki" last="Uchida">Tomoyuki Uchida</name>
</author>
<author><name sortKey="Mogawa, Tomonori" sort="Mogawa, Tomonori" uniqKey="Mogawa T" first="Tomonori" last="Mogawa">Tomonori Mogawa</name>
</author>
<author><name sortKey="Nakamura, Yasuaki" sort="Nakamura, Yasuaki" uniqKey="Nakamura Y" first="Yasuaki" last="Nakamura">Yasuaki Nakamura</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:40B3A82782E1E9F30CED10BDF201F393113E4EB9</idno>
<date when="2004" year="2004">2004</date>
<idno type="doi">10.1007/978-3-540-24775-3_43</idno>
<idno type="url">https://api.istex.fr/ark:/67375/HCB-3GZ3WVK7-F/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001080</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">001080</idno>
<idno type="wicri:Area/Istex/Curation">000D30</idno>
<idno type="wicri:Area/Istex/Checkpoint">000927</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000927</idno>
<idno type="wicri:doubleKey">0302-9743:2004:Uchida T:finding:frequent:structural</idno>
<idno type="wicri:Area/Main/Merge">000A07</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:04-0300385</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000033</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000149</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000020</idno>
<idno type="wicri:explorRef" wicri:stream="PascalFrancis" wicri:step="Checkpoint">000020</idno>
<idno type="wicri:doubleKey">0302-9743:2004:Uchida T:finding:frequent:structural</idno>
<idno type="wicri:Area/Main/Merge">000B09</idno>
<idno type="wicri:Area/Main/Curation">000998</idno>
<idno type="wicri:Area/Main/Exploration">000998</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Finding Frequent Structural Features among Words in Tree-Structured Documents</title>
<author><name sortKey="Uchida, Tomoyuki" sort="Uchida, Tomoyuki" uniqKey="Uchida T" first="Tomoyuki" last="Uchida">Tomoyuki Uchida</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Information Sciences, Hiroshima City University, 731-3194, Hiroshima</wicri:regionArea>
<wicri:noRegion>Hiroshima</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author><name sortKey="Mogawa, Tomonori" sort="Mogawa, Tomonori" uniqKey="Mogawa T" first="Tomonori" last="Mogawa">Tomonori Mogawa</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>Department of Computer and Media Technologies, Hiroshima City University, 731-3194, Hiroshima</wicri:regionArea>
<wicri:noRegion>Hiroshima</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author><name sortKey="Nakamura, Yasuaki" sort="Nakamura, Yasuaki" uniqKey="Nakamura Y" first="Yasuaki" last="Nakamura">Yasuaki Nakamura</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Information Sciences, Hiroshima City University, 731-3194, Hiroshima</wicri:regionArea>
<wicri:noRegion>Hiroshima</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s" type="main" xml:lang="en">Lecture Notes in Computer Science</title>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Data analysis</term>
<term>Data mining</term>
<term>Electronic document</term>
<term>File structure</term>
<term>HTML language</term>
<term>Information extraction</term>
<term>Latex</term>
<term>SGML language</term>
<term>Text</term>
<term>Tree structured method</term>
<term>Word</term>
<term>XML language</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Analyse donnée</term>
<term>Document électronique</term>
<term>Extraction information</term>
<term>Fouille donnée</term>
<term>Langage HTML</term>
<term>Langage SGML</term>
<term>Langage XML</term>
<term>Latex</term>
<term>Mot</term>
<term>Méthode arborescente</term>
<term>Structure fichier</term>
<term>Texte</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Document électronique</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: Many electronic documents such as SGML/HTML/XML files and LaTeX files have tree structures. Such documents are called tree-structured documents. Many tree-structured documents contain large plain texts. In order to extract structural features among words from tree-structured documents, we consider the problem of finding frequent structured patterns among words in tree-structured documents. Let k≥ 2 be an integer and (W 1,W 2,...,W k ) a list of words which are sorted in lexicographical order. A consecutive path pattern on (W 1 , W 2 ,..., W k ) is a sequence 〈t 1;t 2;...,t k − 1〉 of labeled rooted ordered trees such that, for i=1,2,...,k-1, (1) t i consists of only one node having the pair (W i ,W i + 1) as its label, or (2) t i has just two nodes whose degrees are one and which are labeled with W i and W i + 1, respectively. We present a data mining algorithm for finding all frequent consecutive path patterns in tree-structured documents. Then, by reporting experimental results on our algorithm, we show that our algorithm is efficient for extracting structural features from tree-structured documents.</div>
</front>
</TEI>
<affiliations><list><country><li>Japon</li>
</country>
</list>
<tree><country name="Japon"><noRegion><name sortKey="Uchida, Tomoyuki" sort="Uchida, Tomoyuki" uniqKey="Uchida T" first="Tomoyuki" last="Uchida">Tomoyuki Uchida</name>
</noRegion>
<name sortKey="Mogawa, Tomonori" sort="Mogawa, Tomonori" uniqKey="Mogawa T" first="Tomonori" last="Mogawa">Tomonori Mogawa</name>
<name sortKey="Mogawa, Tomonori" sort="Mogawa, Tomonori" uniqKey="Mogawa T" first="Tomonori" last="Mogawa">Tomonori Mogawa</name>
<name sortKey="Nakamura, Yasuaki" sort="Nakamura, Yasuaki" uniqKey="Nakamura Y" first="Yasuaki" last="Nakamura">Yasuaki Nakamura</name>
<name sortKey="Nakamura, Yasuaki" sort="Nakamura, Yasuaki" uniqKey="Nakamura Y" first="Yasuaki" last="Nakamura">Yasuaki Nakamura</name>
<name sortKey="Uchida, Tomoyuki" sort="Uchida, Tomoyuki" uniqKey="Uchida T" first="Tomoyuki" last="Uchida">Tomoyuki Uchida</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Informatique/explor/SgmlV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000998 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000998 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Informatique
   |area=    SgmlV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:40B3A82782E1E9F30CED10BDF201F393113E4EB9
   |texte=   Finding Frequent Structural Features among Words in Tree-Structured Documents
}}

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jul 1 14:26:08 2019. Site generation: Wed Apr 28 21:40:44 2021

Serveur d'exploration sur SGML

Finding Frequent Structural Features among Words in Tree-Structured Documents

Finding Frequent Structural Features among Words in Tree-Structured Documents

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

	Serveur d'exploration sur SGML
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.